W5 Lab Assignment

This lab covers some fundamental plots of 1-D data.



In [27]:

    
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

sns.set_style('white')

%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")

Q1 1-D Scatter Plot

Using fake data

Remember that, if you want to play with visualization tools, you can use not only the real data, but also fake data. Actually it is a nice way to experiment because you can control every aspect of data. Let's create some random numbers.

The function np.random.randn() generates a sample with size $N$ from the standard normal distribution.



In [28]:

    
print( np.random.rand(10) )









    



[ 0.92732254  0.44601266  0.59433485  0.39955111  0.54506167  0.95558355
  0.1375084   0.21573451  0.04771416  0.15855766]

The following small function generates $N$ normally distributed numbers:



In [29]:

    
def generate_many_numbers(N=10, mean=5, sigma=3):
    return mean + sigma * np.random.randn(N)

Generate 10 normally distributed numbers with mean 5 and sigma 3:



In [30]:

    
data = generate_many_numbers(N=10)
print(data)









    



[-1.70232553  6.50831593  6.00320728  9.22047422  7.34258577  4.30483447
  1.82233298  1.58048897  4.40789351  0.86419455]

The most immediate method to visualize 1-D data is just plotting it. Here we can use the scatter() function to draw a scatter plot. The most basic usage of this function is to provide x and y.



In [31]:

    
x = np.arange(1,11)
y = x + 5
print(x)
print(y)
plt.scatter(x, y)









    



[ 1  2  3  4  5  6  7  8  9 10]
[ 6  7  8  9 10 11 12 13 14 15]






    Out[31]:





<matplotlib.collections.PathCollection at 0x7f9d5738fa90>

But here we only have x (the generated data). We can set the y values to 0. The np.zeros_like(data) function creates a numpy array (list) that have the same dimension as the argument.



In [32]:

    
print(np.zeros_like(data))









    



[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Now let's plot the generated 1-D data.



In [33]:

    
plt.figure(figsize=(10,1)) # set figure size, width = 10, height = 1
plt.scatter(data, np.zeros_like(data), s=50) # set size of symbols to 50. Change it and see what happens. 
plt.gca().axes.get_yaxis().set_visible(False) # set y axis invisible

Ok, I think we can see all data points. But what if we have more numbers?



In [34]:

    
# TODO: generate 100 numbers and plot them in the same way. 
data = np.random.rand(100)

plt.figure(figsize=(10,1))
plt.scatter(data, np.zeros_like(data), s = 50) 
plt.gca().axes.get_yaxis().set_visible(False)

Of course we can't see much at the center. We can add "jitters" using the np.random.rand() function.



In [35]:

    
data = generate_many_numbers(N=100)

# TODO: create a list of 100 random numbers using np.random.rand()
# zittered_ypos = ??

zittered_ypos = np.random.rand(100)

plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s=50)
plt.gca().axes.get_yaxis().set_visible(False)

Let's also make the symbol transparent. Here is a useful Google query, and the documentation of scatter() also helps.



In [36]:

    
data = generate_many_numbers(N=200)

# From the last question
# zittered_ypos = ??

# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)
# TODO: implement this
zittered_ypos = np.random.rand(200)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, alpha = 0.35)
plt.gca().axes.get_yaxis().set_visible(False)

We can use transparency as well as empty symbols.

Increase the number of points to 1,000
Set the symbol empty and edgecolor red (a useful query)



In [37]:

    
# TODO: implement this
# data = ?? 
# zittered_ypos = ??


# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)


data = np.random.rand(1000)
zittered_ypos = np.random.rand(1000)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, c = 'white', edgecolors='r')
plt.gca().axes.get_yaxis().set_visible(False)

Lots and lots of points

Let's use real data. Load the IMDb dataset that we used before.



In [38]:

    
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()









    Out[38]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      0
            !Next?
       1994
       5.4
        5
    
    
      1
         #1 Single
       2006
       6.1
       61
    
    
      2
       #7DaysLater
       2013
       7.1
       14
    
    
      3
        #Bikerlive
       2014
       6.8
       11
    
    
      4
         #ByMySide
       2012
       5.5
       13

Try to plot the 'Rating' information using 1D scatter plot. Does it work?



In [39]:

    
# TODO: plot 'rating'

rating = movie_df['Rating'].values
plt.figure(figsize=(10,1)) 
plt.scatter(rating, np.zeros_like(rating), s = 50) 
plt.gca().axes.get_yaxis().set_visible(False)

Q2 Histogram

There are too many data points! Let's try histogram. Actually pandas supports plotting through matplotlib and you can directly visualize dataframes and series.



In [40]:

    
movie_df['Rating'].hist()









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9d5392f358>

Looks good! Can you increase or decrease the number of bins? Find the documentation here.



In [41]:

    
# TODO: try different number of bins
movie_df['Rating'].hist(bins = 30)









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9d538db5f8>



In [42]:

    
movie_df['Rating'].hist(bins = 20)









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9d53806a90>

Q3 Boxplot

Now let's try boxplot. We can use pandas' plotting functions. The usages of boxplot is here.



In [43]:

    
movie_df['Rating'].plot(kind='box', vert=False)









    Out[43]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9d53875a20>

Or try seaborn's boxplot() function:



In [44]:

    
sns.boxplot(movie_df['Rating'])









    Out[44]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f9d56b4f6a0>

We can also easily draw a series of boxplots grouped by categories. For example, let's do the boxplots of movie ratings for different decades.



In [45]:

    
df = movie_df.sort('Year')
df.head()









    Out[45]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      215207
                    Passage de Venus
       1874
       6.5
        174
    
    
      234798
          Sallie Gardner at a Gallop
       1878
       7.3
        452
    
    
      186796
       Man Walking Around the Corner
       1887
       5.1
        365
    
    
      57131 
                    Accordion Player
       1888
       5.7
        433
    
    
      232543
               Roundhay Garden Scene
       1888
       7.7
       3451

One easy way to transform a particular year to the decade (e.g., 1874 -> 1870): divide by 10 and multiply it by 10 again.

In Python 3, the // operator is used for integer division.



In [46]:

    
print(1874//10)
print(1874//10*10)
decade = (df['Year']//10) * 10
decade.head()









    



187
1870






    Out[46]:





215207    1870
234798    1870
186796    1880
57131     1880
232543    1880
Name: Year, dtype: int64



In [47]:

    
ax = sns.boxplot(x=decade, y=df['Rating'])
ax.figure.set_size_inches(12, 8)

Can you draw boxplots of movie votes for different decade?



In [48]:

    
# TODO
ax = sns.boxplot(x=decade, y=df['Votes'])
ax.figure.set_size_inches(12, 8)

What do you see? Can you actually see the "box"? The number of votes span a very wide range, from 1 to more than 1.4 million. One way to deal with this is to make a log-transformation of votes, which can be done with the numpy.log() function.



In [49]:

    
log_votes = np.log(df['Votes'])
log_votes.head()









    Out[49]:





215207    5.159055
234798    6.113682
186796    5.899897
57131     6.070738
232543    8.146419
Name: Votes, dtype: float64

Can you draw boxplots of log-transformed movie votes for different decade?



In [50]:

    
# TODO
ax = sns.boxplot(x=decade, y = log_votes)
ax.figure.set_size_inches(12, 8)



In [ ]:

	Title	Year	Rating	Votes
0	!Next?	1994	5.4	5
1	#1 Single	2006	6.1	61
2	#7DaysLater	2013	7.1	14
3	#Bikerlive	2014	6.8	11
4	#ByMySide	2012	5.5	13

	Title	Year	Rating	Votes
215207	Passage de Venus	1874	6.5	174
234798	Sallie Gardner at a Gallop	1878	7.3	452
186796	Man Walking Around the Corner	1887	5.1	365
57131	Accordion Player	1888	5.7	433
232543	Roundhay Garden Scene	1888	7.7	3451